COVID-19 has been dominating our thoughts, our lives, and the news for months now. As this deadly pandemic ravages the world, the news have been reporting that racial disparities have deadly implications for African Americans. Reports suggest an overrepresentation of infections, hospitalizations, and deaths for African Americans compared to their white counterparts. This is unsuprising for countless reasons, but I wanted to dig into the data for myself. There are many different ways to approach this analysis, but for simplicity's sake, I use data reporting COVID-related deaths by county and match that to 2010 US census data reporting racial demographics by county. Here, I show data demonstrating that majority black communities are being disproportinately affected by COVID-19.
#pip install chart_studio
#pip install "notebook>=5.3"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime as dt
import seaborn as sns
import chart_studio
import chart_studio.plotly as py
import plotly
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
url = "https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_deaths_usafacts.csv"
df_us_deaths = pd.read_csv(url)
Link to data: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/
The data I will be using to show COVID-19 death rates comes from USAFacts.org. USAFacts lists cumulative deaths in each county in each state of the US starting 1/22/20, and includes state and county FIPS. These codes will come in handy later for merging dataframes. USAFacts also has separate data sets for confirmed cases and population adjustments. I will be using confirmed deaths as a metric for severity, and conducting my own county-based population adjustments.
df_us_deaths.head() #check out the data!
df_us_deaths.shape # There are 3195 counties in the US, including unallocated territories
df_us_deaths.isnull().values.any() # yayyy we have all our data!!
#Format column names
#probably going to want dates in datetime format
df_labels = df_us_deaths.iloc[:,:4]
df_dates = df_us_deaths.iloc[:, 4:]
names_old = df_dates.columns.tolist()
names_new = []
for i in names_old :
dtobject = dt.strptime(i, "%m/%d/%y").strftime("%m-%d-%Y")
names_new += [dtobject]
df_dates.columns = names_new
df_us_deaths = pd.concat([df_labels, df_dates], axis = 1)
#also don't want that space in the "County Name" column
df_us_deaths = df_us_deaths.rename(columns = {"County Name":"County"})
#get rid of county info (for now)
df_state_deaths = df_us_deaths.iloc[:, 2:]
#group state rows together
df_state_deaths = df_state_deaths.drop(columns = ["stateFIPS"])
df_deaths_by_state = df_state_deaths.groupby(["State"], as_index=False).agg("sum")
#reshape data frame
df_deaths_by_state = pd.melt(df_deaths_by_state, id_vars = "State").rename(columns = {"variable": "Date", "value": "Deaths"})
#rows with all zeros (no deaths) aren't very informative for us...
df_deaths_by_state = df_deaths_by_state.loc[df_deaths_by_state["Deaths"] != 0, :]
# plot!!
plot = px.line(df_deaths_by_state,
x='Date',
y='Deaths',
color='State',
title = "Deaths by state over time",
width = 1000,
height = 700)
plotly.offline.iplot(plot)
The plots I use here are interactive through plotly. Double clicking on an item in the legend or on a line in the chart will isolate it so you can view only that data. Then, single clicking on additional states will add them to the chart for comparison. Hovering over a specific data point will give you the cumulative deaths up to that specific date in that specific state.
# Extract the total number of deaths to date per county
df_us_deaths['County,State'] = df_us_deaths[['County', 'State']].agg(', '.join, axis=1)
df_county_deaths = df_us_deaths.drop(columns = ["State", "stateFIPS", "County"])
df_county_totals = df_county_deaths.iloc[:, [0,-1,-2]]
df_county_totals = df_county_totals.rename(columns = {df_county_totals.columns[-1]:"Deaths"})
#also, there are a lot of zeros so I'll take these out for visualization purposes...
df_county_totals_deaths = df_county_totals.loc[df_county_totals["Deaths"] != 0, :]
# plot!!
hist = px.histogram(df_county_totals_deaths,
x="Deaths",
nbins = 100,
log_y = True,
title = "Distribution of deaths by county",
marginal = "violin",
hover_data = ["County,State", "Deaths"],
width = 1000,
height = 600
)
hist.update_layout(yaxis_title_text = 'Number of counties')
plotly.offline.iplot(hist)
This histogram shows the distribution of counties based on COVID-related deaths. This is interesting as a snapshot, but what we really want to look at with this analysis is the racial demographics of those harder-hit counties. Time to bring in the census data...
#Get census data
al_mo_url = "https://www2.census.gov/programs-surveys/popest/datasets/2010/modified-race-data-2010/stco-mr2010_al_mo.csv"
df_al_mo = pd.read_csv(al_mo_url)
mt_wy_url = "https://www2.census.gov/programs-surveys/popest/datasets/2010/modified-race-data-2010/stco-mr2010_mt_wy.csv"
df_mt_wy = pd.read_csv(mt_wy_url, encoding = 'latin-1')
df_census = pd.concat([df_al_mo, df_mt_wy], ignore_index=True)
The data I will be using to determine racial demographics comes from the United States Census Bureau. Of note, this data comes from the last nationwide census in 2010. Subsequent censuses have only included areas above a certain population threshold (65,000 people), which may disclude areas of interest from this analysis. Thus, until 2020 census data is publicly available, 2010 will have to do.
This data set includes information about sex, Hispanic origin, age group, and race for each county by FIPS. I am interested in looking at racial demographics, but a lot of other cool analyses could be performed with this information.
#for this project, I'm interested in race by region
df_census = df_census.drop(columns = ["SUMLEV", "SEX", "AGEGRP"])
#Gotta fix these column names too
dict_names = {"STATE":"stateFIPS",
"COUNTY":"countyFIPS",
"STNAME":"State",
"CTYNAME":"County",
"ORIGIN":"Hispanic",
"IMPRACE":"Race",
"RESPOP":"Num_res"}
df_census = df_census.rename(columns = dict_names)
#The census countyFIPS are in a different format that the USAfacts countyFIPS :(
a = df_census["stateFIPS"]
b = df_census["countyFIPS"]
df_census.loc[b < 10, "countyFIPS"] = a.apply(str) + "00" + b.apply(str)
df_census.loc[b >= 100, "countyFIPS"] = a.apply(str) + b.apply(str)
df_census.loc[(b >= 10) & (b < 100), "countyFIPS"] = a.apply(str) + "0" + b.apply(str)
df_census["stateFIPS"] = df_census["stateFIPS"].astype(int)
df_census["countyFIPS"] = df_census["countyFIPS"].astype(int)
#df_census.head(10)
df_census["Num_res"].sum() #about 300 million people in the US as of 2010... math checks out
There are 31 different race categories in the US census data, most of which are mixed. It would be difficult to categorize and find meaningful data based on all of these categories, so I'm going to determine which categories comprise the majority of the population and run the analysis based on these categories.
Also, census information separates Hispanic origin from race. I'm going to add all residents who identify as having Hispanic origin to a separate race, and not include these people in the racial group they had initially chose. (i.e. "Hispanic white" --> "Hispanic", "non-Hispanic white" --> "white")
#Create separate race category for everyone who identifies as Hispanic (Race "0")
#if the person identifies as Hispanic, add them to Race 0
df_census.loc[df_census.Hispanic == 2, "Race"] = 0
#Look at histogram of how prevalent these races are to determine what to include in analysis
df_pop_by_race = df_census.groupby(["Race"], as_index=False).sum().drop(columns = ["stateFIPS", "countyFIPS", "Hispanic"]).sort_values(by = "Num_res", ascending = False)
pop = px.bar(df_pop_by_race,
x='Race',
y='Num_res',
title = "Racial makeup of the US",
width = 1000,
height = 600)
pop.update_layout(yaxis_title_text = 'Number of residents',
xaxis_type = 'category')
plotly.offline.iplot(pop)
Based on the above plot, races 1, 0, 2, 4, 3, 6, 8, & 7 make up the overwhelming majority of the US population, so this analysis focuses on those categories, where:
races = [1, 0, 2, 4, 3, 6, 8, 7]
df_census = df_census[df_census["Race"].isin(races)].drop(columns = ["Hispanic"])
To further condense our list, I'm combining the biracial categories with the corresponding non-white race. By all accounts, these people still experience racism. As such, our list will consist of just
df_census.loc[df_census.Race == 6, "Race"] = 2
df_census.loc[df_census.Race == 7, "Race"] = 3
df_census.loc[df_census.Race == 8, "Race"] = 4
#Replace race number indicator with actual race
df_census["Race"] = df_census["Race"].replace({0: "Hispanic",
1: "White",
2: "Black",
3: "American Indian",
4: "Asian"})
#Look at US demographics based on these major groups
df_pop_by_race1 = df_census.groupby(["Race"], as_index=False).sum().drop(columns = ["stateFIPS", "countyFIPS"]).sort_values(by = "Num_res", ascending = False)
df_pop_by_race1["Percent of total population"] = ((df_pop_by_race1["Num_res"]/df_pop_by_race1["Num_res"].sum())*100).round(2)
pop = px.bar(df_pop_by_race1,
x='Race',
y='Percent of total population',
title = "US Demographics",
width = 1000,
height = 600
)
pop.update_layout(yaxis_title_text = 'Percent of total population')
plotly.offline.iplot(pop)
Here you have the US racial demographics as of 2010 based on the top 5 most prevalent race categories. These are the races that will be included in our analysis.
# sum residents of each race by state
df_census_by_region = df_census.groupby(["stateFIPS", "countyFIPS", "Race"], as_index = False).agg({"Num_res":"sum"})
# Merge residents by race of each region with COVID deaths of each region
df_region_race_deaths = pd.merge(df_census_by_region, df_county_totals, on = ["countyFIPS"])
#Add percent race by county as a column
dfx = df_region_race_deaths.groupby(["countyFIPS"], as_index = False).agg({"Num_res":"sum"}).rename(columns = {"Num_res":"total_res"})
df_percents_by_county = pd.merge(df_region_race_deaths, dfx, on = "countyFIPS")
df_percents_by_county.loc[:, "percent_race"] = ((df_percents_by_county["Num_res"]/df_percents_by_county["total_res"])*100).round(2)
#Add percent death by county as a column
df_percents_by_county.loc[:, "percent_death"] = ((df_percents_by_county["Deaths"]/df_percents_by_county["total_res"])*100).round(5)
df_percents_by_county = df_percents_by_county[["County,State",'total_res','Deaths',"percent_death", 'Race','Num_res', 'percent_race']]
#pd.set_option('display.max_rows', None)
df_percents_by_county
Now we have a data frame that gives us information regarding the total number of residents by race and the total number of deaths due to COVID-19, as well as the percentages of each normalized to respective county population.
#Just interested in the demographics of my home county...
df_lebanon = df_percents_by_county.loc[df_percents_by_county["County,State"]=="Lebanon County, PA", :]
#df_lebanon
#And of DC....
df_dc = df_percents_by_county.loc[df_percents_by_county["County,State"]=="Washington, DC", :]
#df_dc
# counties that have a majority race, in order of decreasing percent death
df_majority_counties = df_percents_by_county.loc[df_percents_by_county.groupby("County,State")["percent_race"].idxmax()].drop(columns = ["total_res", "Deaths", "Num_res"]).sort_values("percent_death", ascending = False)
df_majority_counties = df_majority_counties.loc[df_majority_counties["percent_race"] > 50, :]
# pull counties with highest percent death
df_top_counties = df_majority_counties.head(10)
#pull top counties by race
df_tc_wh = df_top_counties.loc[df_top_counties["Race"] == "White",:]
df_tc_co = df_top_counties.loc[df_top_counties["Race"] != "White",:]
#calculate majority counties (again) and divide by race
df_maj_wh = df_majority_counties.loc[df_majority_counties["Race"] == "White",:]
df_maj_co = df_majority_counties.loc[df_majority_counties["Race"] != "White",:]
#calculate percentage of total counties
percent_white = (df_tc_wh.shape[0]/df_maj_wh.shape[0])*100
percent_poc = (df_tc_co.shape[0]/df_maj_co.shape[0])*100
#plot
df_tc = pd.DataFrame({"Race": ["Majority white", "Majority POC"],
"% counties in top 10 affected by COVID":[percent_white, percent_poc]})
bar4 = px.bar(df_tc,
x = "Race",
y = "% counties in top 10 affected by COVID",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "Percent of counties within race in the top 10 highest death rates",
hover_data = ["% counties in top 10 affected by COVID"],
width = 600,
height = 600)
bar4.update_yaxes(tickprefix = "%")
plotly.offline.iplot(bar4)
As of 4/29, 2.82% of counties in the US with a majority non-white population are in the top 10 counties with the highest percent death by COVID-19 according to percent death rate. Only 0.14% of majority white counties are in this list.
# find the percent race based on white and black for each county
race = df_percents_by_county["Race"]
df_race = df_percents_by_county.loc[(race == "White") | (race == "Black"), :]
#plot
px.defaults.width = 800
px.defaults.height = 800
scat = px.scatter(df_race,
x = "percent_race",
y = "percent_death",
size = "percent_death",
hover_data = ["County,State", "percent_race", "percent_death"],
facet_row = "Race",
color = "percent_death",
color_continuous_scale=px.colors.sequential.Burgyl,
width = 1000
)
scat.update_yaxes(tickprefix = "%")
scat.update_layout(xaxis_title_text = 'Percent county pop that identifies as respective race',
title = "Death rate by racial makeup of county"
)
plotly.offline.iplot(scat)
The above plot demonstrates the relationship between the percentage of the county population that is either white or black, and the percentage of the county population that died due to COVID-19. Big bubbles in the top right quadrant of either plot represent counties with a high percentage of that particular race as well as a high number of deaths relative to county population size. The plot representing people who identify as Black or African-American has several of these markers, indicating higher death rates in majority Black counties, while the plot representing people who identify as non-Hispanic white does not.
Next, we focus in on counties where the majority of the population identifies as either Black/African American or white. We can determine the majority race of a county by defining it as greater than 50% for that county. The rest of this analysis will focus on "majority black" and "majority white" counties in this way.
# Determines counties where the majority race is the race number inputted, majority defined as > 50%
def majority_race(race):
r = df_percents_by_county["percent_race"]
df_majority = df_percents_by_county.loc[r > 50, :]
df_majority = df_majority.loc[df_majority["Race"] == race, :]
return df_majority
#create data frames based on majority race
df_black_majority = majority_race("Black")
df_white_majority = majority_race("White")
#df_hisp_majority = majority_race("Hispanic")
#df_asian_majority = majority_race("Asian")
#df_native_majority = majority_race("American Indian")
# combine majority white and majority black data frames
df_race_majority = pd.concat([df_white_majority, df_black_majority])
# plot
px.defaults.width = 800
px.defaults.height = 600
hist2 = px.histogram(df_race_majority,
x="percent_death",
nbins = 20,
log_y = True,
title = "Distribution of deaths by racial makeup of county",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
opacity = 0.8,
histnorm = "percent",
#facet_row = "Race",
hover_data = ["County,State", "percent_death", "percent_race"],
labels={'percent_death':'percent death', "percent_race":"percent race"},
marginal = "violin"
)
hist2.update_layout(xaxis_title_text = 'Percent county pop that died due to COVID-19',
yaxis_title_text = "percentage of counties")
plotly.offline.iplot(hist2)
Above are shown the distributions of the percentage of the county population that died due to COVID-19 for both black majority and white majority counties. There are 2,807 counties in the US that are majority white, but only 102 that are majority black, therefore, the y-axis has been standardized to percent. Majority black counties have death rates skewed further right than majority white counties (i.e. more counties have a greater percentage of COVID-related deaths).
#percent of black counties with a death rate over .1%
df_b = df_black_majority.loc[df_black_majority["percent_death"] > .1, :]
percent_b = round(((df_b.shape[0]/df_black_majority.shape[0])*100), 4)
#percent of white counties with a death rate over .1%
df_w = df_white_majority.loc[df_white_majority["percent_death"] > .1, :]
percent_w = round(((df_w.shape[0]/df_white_majority.shape[0])*100), 4)
df_bad_counties = pd.DataFrame({"Race": ["Majority white", "Majority black"],
"%counties death rate > 0.1%":[percent_w, percent_b]})
#plot and compare
px.defaults.width = 600
px.defaults.height = 600
bar2 = px.bar(df_bad_counties,
x = "Race",
y = "%counties death rate > 0.1%",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "Percent of counties with a death rate > 0.1%",
hover_data = ["%counties death rate > 0.1%"])
bar2.update_yaxes(tickprefix = "%")
plotly.offline.iplot(bar2)
As of 4/29, almost 5% of counties in the US with a majority black population have had more than 0.1% of their population die due to COVID-19. The same can be said for only 0.25% of majority white counties.
#majority white counties with at least 1 death due to COVID
df_white_majority_death = df_white_majority.loc[df_white_majority["Deaths"] > 0, :].sort_values("percent_death", ascending = False)
#majority black counties with at least 1 death due to COVID
df_black_majority_death = df_black_majority.loc[df_black_majority["Deaths"] > 0, :].sort_values("percent_death", ascending = False)
#plot and compare
percent_death_white = (df_white_majority_death.shape[0]/df_white_majority.shape[0])*100
percent_death_black = (df_black_majority_death.shape[0]/df_black_majority.shape[0])*100
df_county_rates = pd.DataFrame({"Race": ["Majority white", "Majority black"],
"Percent of counties with death":[percent_death_white, percent_death_black]})
bar3 = px.bar(df_county_rates,
x = "Race",
y = "Percent of counties with death",
color = "Race",
color_discrete_sequence=px.colors.sequential.Rainbow,
title = "Percent of counties with at least one COVID-related death",
hover_data = ["Percent of counties with death"])
bar3.update_yaxes(tickprefix = "%")
plotly.offline.iplot(bar3)
As of 4/29, 77.45% of counties in the US with a majority black population have had at least one death due to COVID-19. The same can be said for only 44.3% of majority white counties. In other words, majority black counties are almost twice as likely to see death due to COVID-19.
Though over 2000 counties in the US have a population comprised mostly of people who identify as non-Hispanic white, counties with majority non-white populations, specifically Black or African-American, are being disproportionately affected by COVID-19. Overall death rates are higher, and they top the lists of counties with the highest COVID death rates. These data support what news sources are reporting regarding the issue. Of note, death rates are likely also impacted by socioeconomic status, healthcare access, and crowdedness (people per square mile), though these issues are also linked to racial disparities.